Definimos las variables para la extracción de los datos de la pagina de Zenodo. Creamos el directorio donde se descargaran los datos y los descargamos directamente.
2022-10-16 21:24:34,809 - Process Data - INFO - Connectome_ASI_050219.xlsx downloaded
2022-10-16 21:24:37,577 - Process Data - INFO - Connectome_BIS_050219.xlsx downloaded
2022-10-16 21:24:40,005 - Process Data - INFO - Connectome_CCQN_050219.xlsx downloaded
2022-10-16 21:24:42,658 - Process Data - INFO - Connectome_CGI_050219.xlsx downloaded
2022-10-16 21:24:45,915 - Process Data - INFO - Connectome_demographics_050219.xlsx downloaded
2022-10-16 21:24:48,815 - Process Data - INFO - Connectome_DERS_050219.xlsx downloaded
2022-10-16 21:24:52,798 - Process Data - INFO - Connectome_DES_050219.xlsx downloaded
2022-10-16 21:24:55,103 - Process Data - INFO - Connectome_DIB_050219.xlsx downloaded
2022-10-16 21:24:57,451 - Process Data - INFO - Connectome_DSS4_050219.xlsx downloaded
2022-10-16 21:25:00,794 - Process Data - INFO - Connectome_instantview_050219.xlsx downloaded
2022-10-16 21:25:09,597 - Process Data - INFO - Connectome_MINI_050219.xlsx downloaded
2022-10-16 21:25:13,497 - Process Data - INFO - Connectome_SCID_050219.xlsx downloaded
2022-10-16 21:25:16,491 - Process Data - INFO - Connectome_SCL90__050219.xlsx downloaded
2022-10-16 21:25:19,190 - Process Data - INFO - Connectome_WHODAS_050219.xlsx downloaded
2022-10-16 21:25:21,636 - Process Data - INFO - participants.xlsx downloaded
Import Data
Podemos leer la información directamente del archivo excel y ejecutar algún preprocesamiento de los datos. Cabe mencionar que algunos archivos de excel pueden requerir engines específicas para leerse correctamente. Si el método read_excel no puede leerlo directamente vale la pena tratar con openxl. > Aunque la recomendación base es evitar los formatos cerrados y usar csv, csv/zipparquet, hdf5.
Code
import pathlibimport pandas as pdimport numpy as npparent_path = pathlib.Path().parent.resolve().parentdemographics_file = parent_path.joinpath('descargas', 'Connectome_demographics_050219.xlsx')demo_data = pd.read_excel(demographics_file, sheet_name="Demographics")dictionary_data = pd.read_excel(demographics_file, sheet_name="Connectome_demographics")demo_data['income'] = pd.to_numeric(demo_data.income, errors='coerce')demo_data['prof_mental'] = demo_data.prof_mental.astype('category')demo_data['support_years'] = pd.to_numeric(demo_data.support_years, errors='coerce')demo_data['work_threeyears'] = pd.to_numeric([ str(value).replace(',', '.') for value in demo_data.work_threeyears], errors='coerce')demo_data.head()
rid
group
demo
sex
age
educ
occup
income
civil_st
child
...
work_thirtydays
work_threeyears
energy_freq
energy_recent
energy_cans
laterality
ed_score
amai
amai_score
notes
0
1
1
1
1.0
41.0
7.0
7.0
20000.0
1.0
0.0
...
1.0
1.0
0.0
0.0
0.0
1.0
87.5
7.0
222.0
NaN
1
2
2
1
1.0
23.0
5.0
3.0
7100.0
6.0
0.0
...
1.0
4.0
0.0
0.0
0.0
1.0
87.5
3.0
104.0
NaN
2
3
2
1
1.0
27.0
5.0
4.0
12000.0
6.0
0.0
...
4.0
1.0
0.0
0.0
0.0
2.0
-87.5
7.0
219.0
NaN
3
4
1
1
1.0
27.0
5.0
7.0
10000.0
6.0
0.0
...
2.0
2.0
0.0
0.0
0.0
1.0
100.0
4.0
110.0
NaN
4
5
1
1
1.0
23.0
5.0
7.0
6000.0
6.0
0.0
...
2.0
4.0
0.0
0.0
0.0
1.0
100.0
3.0
94.0
NaN
5 rows × 28 columns
Herramientas EDA
Skim
La herramienta existe tanto en R como en Python y es de uso sencillo que se integra de forma transparente dentro de jupyter notebook
Code
from skimpy import skimskim(demo_data)
/home/nekrum/proyectos/datavix_lanirem/sudmex_conn/python/.env/lib/python3.10/site-packages/numpy/lib/histograms.py:906: RuntimeWarning: invalid value encountered in divide
return n/db/n.sum(), bin_edges
Esta herramienta genera un reporte html que puede insertarse mediante widgets o iframe dentro del notebook. Su interfaz esta un poco mas dearrollada y es interactiva. Ademas cuenta con una sección de asociaciones que permite revizar las relaciones entre variables.
Code
import sweetviz as svimport warningswarnings.filterwarnings('ignore')result = sv.analyze(demo_data)result.show_notebook()
Notas
Existen varios paquetes que aportan procesos similares y que van mas alla de un info() o un describe pero estas herramientas se enfocan en mostrar una vista rápida de la distribución de variables, valores faltantes y hasta relaciones entre las variables. Una herramienta que consideré incluir es pandas-profiling, puede que en una actualización lo haga, pero de momento presenta un error al cargar librerías de pandas.
Sobre Quarto
En esta prueba de concepto, la renderización de una página web a partir de este notebook a sido directa. Es decir usando los mismo paramétros que en R dentro del chunck inicial, el resultado es dificilmente diferenciable del resultado en R. Lo que me parece relevante es el hecho de que consolo insertar un chunck de codigo al inicio se puede generar un documento PDF, Word, Presentación o una página. Existen alternativas para exportar un jupyternotebook. Pero el que una herramienta funcione en ambos lenguajes y mantenga un estilo simplifica el trabajo.
NOTA: La sección de DataPrep, precede a esta sección de notas, sin embargo por la forma en que se renderea el output de DataPrep se pierde el estilo de secciones posteriores. Esto se puede resolver usando las secciones de DataPrep por separado pero salía del alcance de esta prueba de concepto.
DataPrep
Al igual que SeetViz, este paquete permite generar un reporte detallado de las variables en el dataframe. Y como bonus cuenta con algunos métodos de preprocesamiento que puede ser útil para acelerar el análisis.
Code
from dataprep.eda import create_reportcreate_report(demo_data).show()
DataPrep Report
Overview
Dataset Statistics
Number of Variables
28
Number of Rows
139
Missing Cells
839
Missing Cells (%)
21.6%
Duplicate Rows
0
Duplicate Rows (%)
0.0%
Total Size in Memory
31.3 KB
Average Row Size in Memory
230.5 B
Variable Types
Numerical: 11
Categorical: 17
Dataset Insights
rid is uniformly distributed
Uniform
bro_num and work_threeyears have similar distributions
Similar Distribution
sex has 2 (1.44%) missing values
Missing
age has 2 (1.44%) missing values
Missing
educ has 6 (4.32%) missing values
Missing
occup has 8 (5.76%) missing values
Missing
income has 33 (23.74%) missing values
Missing
civil_st has 30 (21.58%) missing values
Missing
child has 32 (23.02%) missing values
Missing
child_num has 35 (25.18%) missing values
Missing
bro_num has 40 (28.78%) missing values
Missing
place_bro has 36 (25.9%) missing values
Missing
prof_mental has 42 (30.22%) missing values
Missing
years_mental has 43 (30.94%) missing values
Missing
hosp_subst has 47 (33.81%) missing values
Missing
support_ever has 40 (28.78%) missing values
Missing
support_years has 41 (29.5%) missing values
Missing
work_thirtydays has 58 (41.73%) missing values
Missing
work_threeyears has 56 (40.29%) missing values
Missing
energy_freq has 40 (28.78%) missing values
Missing
energy_recent has 40 (28.78%) missing values
Missing
energy_cans has 41 (29.5%) missing values
Missing
laterality has 2 (1.44%) missing values
Missing
ed_score has 4 (2.88%) missing values
Missing
amai has 11 (7.91%) missing values
Missing
amai_score has 11 (7.91%) missing values
Missing
notes has 139 (100.0%) missing values
Missing
occup is skewed
Skewed
income is skewed
Skewed
bro_num is skewed
Skewed
place_bro is skewed
Skewed
years_mental is skewed
Skewed
support_years is skewed
Skewed
work_threeyears is skewed
Skewed
ed_score is skewed
Skewed
group has constant length 1
Constant Length
demo has constant length 1
Constant Length
sex has constant length 3
Constant Length
educ has constant length 3
Constant Length
civil_st has constant length 3
Constant Length
child has constant length 3
Constant Length
child_num has constant length 3
Constant Length
support_ever has constant length 3
Constant Length
work_thirtydays has constant length 3
Constant Length
energy_freq has constant length 3
Constant Length
energy_recent has constant length 3
Constant Length
energy_cans has constant length 3
Constant Length
laterality has constant length 3
Constant Length
amai has constant length 3
Constant Length
notes has all distinct values
Unique
ed_score has 5 (3.6%) negatives
Negatives
occup has 8 (5.76%) zeros
Zeros
income has 11 (7.91%) zeros
Zeros
years_mental has 51 (36.69%) zeros
Zeros
support_years has 76 (54.68%) zeros
Zeros
1
2
3
4
5
6
Variables
rid
numerical
Approximate Distinct Count
139
Approximate Unique (%)
100.0%
Missing
0
Missing (%)
0.0%
Infinite
0
Infinite (%)
0.0%
Memory Size
2224
Mean
75.2734
Minimum
1
Maximum
160
Zeros
0
Zeros (%)
0.0%
Negatives
0
Negatives (%)
0.0%
rid is uniformly distributed
rid is skewed right (γ1 = 0.1019)
Quantile Statistics
Minimum
1
5-th Percentile
7.9
Q1
35.5
Median
73
Q3
114.5
95-th Percentile
146.1
Maximum
160
Range
159
IQR
79
Descriptive Statistics
Mean
75.2734
Standard Deviation
45.5645
Variance
2076.1276
Sum
10463
Skewness
0.1019
Kurtosis
-1.2022
Coefficient of Variation
0.6053
group
categorical
Approximate Distinct Count
2
Approximate Unique (%)
1.4%
Missing
0
Missing (%)
0.0%
Memory Size
9174
Length
Mean
1
Standard Deviation
0
Median
1
Minimum
1
Maximum
1
Sample
1st row
1
2nd row
2
3rd row
2
4th row
1
5th row
1
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
139
The top 2 categories (2, 1) take over 50.0%
group has words of constant length
demo
categorical
Approximate Distinct Count
2
Approximate Unique (%)
1.4%
Missing
0
Missing (%)
0.0%
Memory Size
9174
The largest value (1) is over 4.15 times larger than the second
largest value (0)
Length
Mean
1
Standard Deviation
0
Median
1
Minimum
1
Maximum
1
Sample
1st row
1
2nd row
1
3rd row
1
4th row
1
5th row
1
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
139
The top 2 categories (1, 0) take over 50.0%
The largest value (1) is over 4.15 times larger than the second
largest value (0)
demo has words of constant length
sex
categorical
Approximate Distinct Count
2
Approximate Unique (%)
1.5%
Missing
2
Missing (%)
1.4%
Memory Size
9316
The largest value (1.0) is over 5.85 times larger than the second
largest value (2.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
1.0
2nd row
1.0
3rd row
1.0
4th row
1.0
5th row
1.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
274
The top 2 categories (1.0, 2.0) take over 50.0%
The largest value (10) is over 5.85 times larger than the second
largest value (20)
sex has words of constant length
age
numerical
Approximate Distinct Count
31
Approximate Unique (%)
22.6%
Missing
2
Missing (%)
1.4%
Infinite
0
Infinite (%)
0.0%
Memory Size
2192
Mean
30.7518
Minimum
18
Maximum
50
Zeros
0
Zeros (%)
0.0%
Negatives
0
Negatives (%)
0.0%
age is skewed right (γ1 = 0.2998)
Quantile Statistics
Minimum
18
5-th Percentile
19.8
Q1
24
Median
30
Q3
37
95-th Percentile
44
Maximum
50
Range
32
IQR
13
Descriptive Statistics
Mean
30.7518
Standard Deviation
7.7039
Variance
59.3497
Sum
4213
Skewness
0.2998
Kurtosis
-0.8158
Coefficient of Variation
0.2505
educ
categorical
Approximate Distinct Count
6
Approximate Unique (%)
4.5%
Missing
6
Missing (%)
4.3%
Memory Size
9044
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
7.0
2nd row
5.0
3rd row
5.0
4th row
5.0
5th row
5.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
266
The top 2 categories (2.0, 5.0) take over 50.0%
educ has words of constant length
occup
numerical
Approximate Distinct Count
10
Approximate Unique (%)
7.6%
Missing
8
Missing (%)
5.8%
Infinite
0
Infinite (%)
0.0%
Memory Size
2096
Mean
3.6107
Minimum
0
Maximum
9
Zeros
8
Zeros (%)
5.8%
Negatives
0
Negatives (%)
0.0%
occup is skewed right (γ1 = 0.4671)
Quantile Statistics
Minimum
0
5-th Percentile
0
Q1
3
Median
3
Q3
4
95-th Percentile
7
Maximum
9
Range
9
IQR
1
Descriptive Statistics
Mean
3.6107
Standard Deviation
2.0175
Variance
4.0703
Sum
473
Skewness
0.4671
Kurtosis
0.1152
Coefficient of Variation
0.5588
occup is not normally distributed (p-value 1.3499206683604491e-18)
occup has 48 outliers
income
numerical
Approximate Distinct Count
39
Approximate Unique (%)
36.8%
Missing
33
Missing (%)
23.7%
Infinite
0
Infinite (%)
0.0%
Memory Size
1696
Mean
6309.6226
Minimum
0
Maximum
50000
Zeros
11
Zeros (%)
7.9%
Negatives
0
Negatives (%)
0.0%
income is skewed right (γ1 = 3.739)
Quantile Statistics
Minimum
0
5-th Percentile
0
Q1
2800
Median
4900
Q3
8000
95-th Percentile
16000
Maximum
50000
Range
50000
IQR
5200
Descriptive Statistics
Mean
6309.6226
Standard Deviation
6834.7511
Variance
4.6714e+07
Sum
668820
Skewness
3.739
Kurtosis
19.2041
Coefficient of Variation
1.0832
income is not normally distributed (p-value 6.339218029230788e-09)
income has 7 outliers
civil_st
categorical
Approximate Distinct Count
6
Approximate Unique (%)
5.5%
Missing
30
Missing (%)
21.6%
Memory Size
7412
The largest value (6.0) is over 1.63 times larger than the second
largest value (2.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
1.0
2nd row
6.0
3rd row
6.0
4th row
6.0
5th row
6.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
218
The top 2 categories (6.0, 2.0) take over 50.0%
The largest value (60) is over 1.63 times larger than the second
largest value (20)
civil_st has words of constant length
child
categorical
Approximate Distinct Count
2
Approximate Unique (%)
1.9%
Missing
32
Missing (%)
23.0%
Memory Size
7276
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
214
The top 2 categories (1.0, 0.0) take over 50.0%
child has words of constant length
child_num
categorical
Approximate Distinct Count
5
Approximate Unique (%)
4.8%
Missing
35
Missing (%)
25.2%
Memory Size
7072
The largest value (0.0) is over 2.27 times larger than the second
largest value (1.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
208
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 2.27 times larger than the second
largest value (10)
child_num has words of constant length
bro_num
numerical
Approximate Distinct Count
10
Approximate Unique (%)
10.1%
Missing
40
Missing (%)
28.8%
Infinite
0
Infinite (%)
0.0%
Memory Size
1584
Mean
2.6869
Minimum
0
Maximum
10
Zeros
5
Zeros (%)
3.6%
Negatives
0
Negatives (%)
0.0%
bro_num is skewed right (γ1 = 1.5009)
Quantile Statistics
Minimum
0
5-th Percentile
0.9
Q1
2
Median
2
Q3
3
95-th Percentile
7
Maximum
10
Range
10
IQR
1
Descriptive Statistics
Mean
2.6869
Standard Deviation
1.7359
Variance
3.0132
Sum
266
Skewness
1.5009
Kurtosis
3.4136
Coefficient of Variation
0.6461
bro_num is not normally distributed (p-value 6.6190821548576405e-15)
bro_num has 14 outliers
place_bro
numerical
Approximate Distinct Count
9
Approximate Unique (%)
8.7%
Missing
36
Missing (%)
25.9%
Infinite
0
Infinite (%)
0.0%
Memory Size
1648
Mean
2.068
Minimum
0
Maximum
9
Zeros
5
Zeros (%)
3.6%
Negatives
0
Negatives (%)
0.0%
place_bro is skewed right (γ1 = 1.9142)
Quantile Statistics
Minimum
0
5-th Percentile
1
Q1
1
Median
2
Q3
3
95-th Percentile
4.9
Maximum
9
Range
9
IQR
2
Descriptive Statistics
Mean
2.068
Standard Deviation
1.5672
Variance
2.4561
Sum
213
Skewness
1.9142
Kurtosis
4.6291
Coefficient of Variation
0.7578
place_bro is not normally distributed (p-value 3.1701947108018384e-17)
place_bro has 4 outliers
prof_mental
categorical
Approximate Distinct Count
20
Approximate Unique (%)
20.6%
Missing
42
Missing (%)
30.2%
Memory Size
2624
The largest value (0) is over 2.79 times larger than the second
largest value (4)
Length
Mean
1.567
Standard Deviation
1.3987
Median
1
Minimum
1
Maximum
11
Sample
1st row
0
2nd row
0
3rd row
3
4th row
0
5th row
0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
126
The largest value (0) is over 2.79 times larger than the second
largest value (4)
years_mental
numerical
Approximate Distinct Count
17
Approximate Unique (%)
17.7%
Missing
43
Missing (%)
30.9%
Infinite
0
Infinite (%)
0.0%
Memory Size
1536
Mean
2.0031
Minimum
0
Maximum
20
Zeros
51
Zeros (%)
36.7%
Negatives
0
Negatives (%)
0.0%
years_mental is skewed right (γ1 = 2.7174)
Quantile Statistics
Minimum
0
5-th Percentile
0
Q1
0
Median
0
Q3
2
95-th Percentile
10
Maximum
20
Range
20
IQR
2
Descriptive Statistics
Mean
2.0031
Standard Deviation
4.0109
Variance
16.087
Sum
192.3
Skewness
2.7174
Kurtosis
7.3727
Coefficient of Variation
2.0023
years_mental is not normally distributed (p-value 6.298570799989492e-23)
years_mental has 14 outliers
hosp_subst
categorical
Approximate Distinct Count
8
Approximate Unique (%)
8.7%
Missing
47
Missing (%)
33.8%
Memory Size
6257
The largest value (0.0) is over 7.33 times larger than the second
largest value (1.0)
Length
Mean
3.0109
Standard Deviation
0.1043
Median
3
Minimum
3
Maximum
4
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
185
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 7.33 times larger than the second
largest value (10)
support_ever
categorical
Approximate Distinct Count
3
Approximate Unique (%)
3.0%
Missing
40
Missing (%)
28.8%
Memory Size
6732
The largest value (0.0) is over 2.5 times larger than the second
largest value (1.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
198
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 2.5 times larger than the second
largest value (10)
support_ever has words of constant length
support_years
numerical
Approximate Distinct Count
11
Approximate Unique (%)
11.2%
Missing
41
Missing (%)
29.5%
Infinite
0
Infinite (%)
0.0%
Memory Size
1568
Mean
0.4537
Minimum
0
Maximum
10
Zeros
76
Zeros (%)
54.7%
Negatives
0
Negatives (%)
0.0%
support_years is skewed right (γ1 = 4.5604)
Quantile Statistics
Minimum
0
5-th Percentile
0
Q1
0
Median
0
Q3
0
95-th Percentile
3.15
Maximum
10
Range
10
IQR
0
Descriptive Statistics
Mean
0.4537
Standard Deviation
1.4104
Variance
1.9893
Sum
44.46
Skewness
4.5604
Kurtosis
23.4284
Coefficient of Variation
3.1089
support_years is not normally distributed (p-value 1.1899573044062926e-24)
support_years has 22 outliers
work_thirtydays
categorical
Approximate Distinct Count
8
Approximate Unique (%)
9.9%
Missing
58
Missing (%)
41.7%
Memory Size
5508
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
1.0
2nd row
1.0
3rd row
4.0
4th row
2.0
5th row
2.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
162
work_thirtydays has words of constant length
work_threeyears
numerical
Approximate Distinct Count
9
Approximate Unique (%)
10.8%
Missing
56
Missing (%)
40.3%
Infinite
0
Infinite (%)
0.0%
Memory Size
1328
Mean
2.5229
Minimum
0
Maximum
7
Zeros
1
Zeros (%)
0.7%
Negatives
0
Negatives (%)
0.0%
work_threeyears is skewed right (γ1 = 0.9157)
Quantile Statistics
Minimum
0
5-th Percentile
1
Q1
1
Median
2
Q3
3
95-th Percentile
5
Maximum
7
Range
7
IQR
2
Descriptive Statistics
Mean
2.5229
Standard Deviation
1.4999
Variance
2.2496
Sum
209.4
Skewness
0.9157
Kurtosis
0.5892
Coefficient of Variation
0.5945
work_threeyears is not normally distributed (p-value 5.2985257588298194e-14)
work_threeyears has 2 outliers
energy_freq
categorical
Approximate Distinct Count
3
Approximate Unique (%)
3.0%
Missing
40
Missing (%)
28.8%
Memory Size
6732
The largest value (0.0) is over 6.75 times larger than the second
largest value (1.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
198
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 6.75 times larger than the second
largest value (10)
energy_freq has words of constant length
energy_recent
categorical
Approximate Distinct Count
2
Approximate Unique (%)
2.0%
Missing
40
Missing (%)
28.8%
Memory Size
6732
The largest value (0.0) is over 5.19 times larger than the second
largest value (1.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
198
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 5.19 times larger than the second
largest value (10)
energy_recent has words of constant length
energy_cans
categorical
Approximate Distinct Count
4
Approximate Unique (%)
4.1%
Missing
41
Missing (%)
29.5%
Memory Size
6664
The largest value (0.0) is over 10.38 times larger than the second
largest value (1.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
0.0
2nd row
0.0
3rd row
0.0
4th row
0.0
5th row
0.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
196
The top 2 categories (0.0, 1.0) take over 50.0%
The largest value (00) is over 10.38 times larger than the second
largest value (10)
energy_cans has words of constant length
laterality
categorical
Approximate Distinct Count
3
Approximate Unique (%)
2.2%
Missing
2
Missing (%)
1.4%
Memory Size
9316
The largest value (1.0) is over 11.9 times larger than the second
largest value (2.0)
Length
Mean
3
Standard Deviation
0
Median
3
Minimum
3
Maximum
3
Sample
1st row
1.0
2nd row
1.0
3rd row
2.0
4th row
1.0
5th row
1.0
Letter
Count
0
Lowercase Letter
0
Space Separator
0
Uppercase Letter
0
Dash Punctuation
0
Decimal Number
274
The top 2 categories (1.0, 2.0) take over 50.0%
The largest value (10) is over 11.9 times larger than the second
largest value (20)
laterality has words of constant length
ed_score
numerical
Approximate Distinct Count
9
Approximate Unique (%)
6.7%
Missing
4
Missing (%)
2.9%
Infinite
0
Infinite (%)
0.0%
Memory Size
2160
Mean
86.2037
Minimum
-100
Maximum
100
Zeros
2
Zeros (%)
1.4%
Negatives
5
Negatives (%)
3.6%
ed_score is skewed left (γ1 = -3.7682)
Quantile Statistics
Minimum
-100
5-th Percentile
17.5
Q1
87.5
Median
100
Q3
100
95-th Percentile
100
Maximum
100
Range
200
IQR
12.5
Descriptive Statistics
Mean
86.2037
Standard Deviation
38.6416
Variance
1493.1765
Sum
11637.5
Skewness
-3.7682
Kurtosis
13.8255
Coefficient of Variation
0.4483
ed_score is not normally distributed (p-value 4.930280310580592e-24)